This is the Introduction to Open Data Science course by Kimmo Vehkalahti & co.
This course consists of six sections and every section has it´s weekly exercises.
We are using GitHub to share our work. My repository you can find here.
You can find the code pressing the “Code”-button on the right ->
getwd()
## [1] "/Users/mirva/IODS-project"
setwd("/Users/mirva/IODS-project/data")
learning2014 = read.csv("learning2014")
dim(learning2014)
## [1] 166 8
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
This data is part of the data which Kimmo Vehkalahti has collected 3.12.2014 - 10.1.2015 on the course:
Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish)
The data has 166 observations and 7 variables which are gender, age, attitude, deep, stra, surf and points
library(ggplot2)
library(GGally)
overw <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
overw
summary(learning2014)
## X gender age attitude deep
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. :1.583
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Median : 83.50 Median :22.00 Median :32.00 Median :3.667
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The variables are all quite normally distributed except the age-variable where the min is 17, median 22 and max 55. Also the differences between men and women are quite small, allthough the distributions are not exactly equal. Biggest overall correlations are between attitude and points (r=.437) and deep and surf (negative, r=-.324), correspondingly smallest overall correlations are between deep and points (negative, r=-.0101), attitude and age (r=.0222) and deep and age (r.0251).
y=“points”, x1=“attitude”, x2=“stra” and x3=“surf”
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
regr1 <- lm(points ~ attitude + stra +surf, data = learning2014)
summary(regr1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
T-value of attitude (5.913) is statistically significant (p<.001) so the attitude-variable explains well the points-variable. T-values of stra (1.575) and surf(-0.731) aren´t significant in this model so I will remove the one with smaller t-value: the variable surf.
regr2 <- lm(points ~ attitude + stra, data = learning2014)
summary(regr2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The p-value of the variable stra (p=.08927) remains still >.05 so i will also remove it from the model.
regr3 <- lm(points ~ attitude, data = learning2014)
summary(regr3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The relationship between attitude and points: when y = “attitude” and x = “points” we can predict the value of y with model:
y = 11.63715 + 0.35255x + e
Multiple R-squared of this model is 0.1906, so the model explains 19,06% of the variation of attitude
par(mfrow = c(2,2))
plot(regr3, which = c(1,2,5))
library(ggplot2); library(dplyr); library(GGally); library(boot)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
At first I´ll read the alcohol data into R
getwd()
## [1] "/Users/mirva/IODS-project"
setwd("/Users/mirva/IODS-project/data")
alc = read.csv("alcohol")
str(alc)
## 'data.frame': 382 obs. of 36 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This data includes 382 observations and 36 variables
The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal.
Using Data Mining To Predict Secondary School Student Alcohol Consumption.
Fabio Pagnotta, Hossain Mohammad Amran.
Department of Computer Science,University of Camerino
You can find more info about the variables and data here
I will study the relationship between high/low alcohol consumption and four other variables: Pstatus, Medu, Fedu and famrel
My personal hypothesis about these four variables and alcohol consumption are as follows:
Pstatus = parents´ cohabitation status (living together or apart): My hypothesis is that students whose parents are living together could have lower risk of high alcohol consumption. Medu and Fedu = mother’s education and father´s education: My hypothesis is that parents´lower education (and therefore possibly lower socio economical status) could increase the risk of high alcohol consumption.
famrel = quality of family relationships: My hypothesis here is that students whose family relationships are graded 4 or 5 could have lower risk of high alcohol consumption.
alc %>% group_by(Pstatus, high_use) %>% summarise(count = n())
## Source: local data frame [4 x 3]
## Groups: Pstatus [?]
##
## Pstatus high_use count
## <fctr> <lgl> <int>
## 1 A FALSE 26
## 2 A TRUE 12
## 3 T FALSE 242
## 4 T TRUE 102
Within students whose parents´cohabilitation status is A = “apart” 26 (68%) students´ alcohol usage is low and 12 (32%) it is high. Within students whose parents live together (status = T) 242 (70%) students´ alcohol usage is low and 102 (30%) it is high.
alc %>% group_by(Medu, high_use) %>% summarise(count = n())
## Source: local data frame [10 x 3]
## Groups: Medu [?]
##
## Medu high_use count
## <int> <lgl> <int>
## 1 0 FALSE 1
## 2 0 TRUE 2
## 3 1 FALSE 33
## 4 1 TRUE 18
## 5 2 FALSE 80
## 6 2 TRUE 18
## 7 3 FALSE 59
## 8 3 TRUE 36
## 9 4 FALSE 95
## 10 4 TRUE 40
g1me <- ggplot(alc, aes(x = high_use, y = Medu))
g1me + geom_boxplot() + xlab("High alcohol consumption") + ylab("Mothers education")
If we look at the mothers education level the results of summary are as follows:
So there might be some differences between these groups but ns are quite low, especially in non-educated-group.
The boxplots of these two alcohol consumption groups look exactly the same.
alc %>% group_by(Fedu, high_use) %>% summarise(count = n())
## Source: local data frame [9 x 3]
## Groups: Fedu [?]
##
## Fedu high_use count
## <int> <lgl> <int>
## 1 0 FALSE 2
## 2 1 FALSE 53
## 3 1 TRUE 24
## 4 2 FALSE 75
## 5 2 TRUE 30
## 6 3 FALSE 72
## 7 3 TRUE 27
## 8 4 FALSE 66
## 9 4 TRUE 33
g1fe <- ggplot(alc, aes(x = high_use, y = Fedu))
g1fe + geom_boxplot() + xlab("High alcohol consumption") + ylab("Fathers education")
With fathers´education the same numbers are as follows:
So also here we can see that fathers´higher education doesn´t possibly imply lower drinking percentages as opposed to my hypothesis.
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## Source: local data frame [10 x 3]
## Groups: famrel [?]
##
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 10
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 135
## 8 4 TRUE 54
## 9 5 FALSE 78
## 10 5 TRUE 24
g1fr <- ggplot(alc, aes(x = high_use, y = Fedu))
g1fr + geom_boxplot() + xlab("High alcohol consumption") + ylab("Quality of family relations")
If we look at the connection between alcohol consumption and quality of family relations distribution in different grade groups are:
So there might be some support for my hypothesis that better the family relations lower the alcohol consumption
m <- glm(high_use ~ Pstatus + Medu + Fedu + famrel, data = alc, family = "binomial")
summary.glm(m)
##
## Call:
## glm(formula = high_use ~ Pstatus + Medu + Fedu + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1682 -0.8475 -0.7827 1.4193 1.7343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.10331 0.66634 0.155 0.8768
## PstatusT -0.04070 0.37567 -0.108 0.9137
## Medu -0.01669 0.13690 -0.122 0.9029
## Fedu 0.06193 0.13545 0.457 0.6475
## famrel -0.26540 0.12146 -2.185 0.0289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 460.64 on 377 degrees of freedom
## AIC: 470.64
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) PstatusT Medu Fedu famrel
## 0.10331492 -0.04070326 -0.01669495 0.06193416 -0.26540413
Only the variable quality of family relations is significant in this model.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.1088405 0.2964832 4.0825324
## PstatusT 0.9601140 0.4688583 2.0693320
## Medu 0.9834436 0.7515044 1.2871504
## Fedu 1.0638923 0.8164830 1.3904537
## famrel 0.7668960 0.6034634 0.9732521
Every one of these confidence intervals includes 1 except famrel so it implies these other variables make no difference in the model.
Removing non-significant variables one by one doesn´t give any new variables to the model
m2 <- glm(high_use ~ Medu + Fedu + famrel, data = alc, family = "binomial")
summary.glm(m2)
##
## Call:
## glm(formula = high_use ~ Medu + Fedu + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1715 -0.8458 -0.7703 1.4164 1.7327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.06539 0.56687 0.115 0.9082
## Medu -0.01521 0.13618 -0.112 0.9111
## Fedu 0.06190 0.13546 0.457 0.6477
## famrel -0.26612 0.12130 -2.194 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 460.65 on 378 degrees of freedom
## AIC: 468.65
##
## Number of Fisher Scoring iterations: 4
m3 <- glm(high_use ~ Fedu + famrel, data = alc, family = "binomial")
summary.glm(m3)
##
## Call:
## glm(formula = high_use ~ Fedu + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1732 -0.8400 -0.7666 1.4146 1.7244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.04792 0.54493 0.088 0.9299
## Fedu 0.05207 0.10288 0.506 0.6128
## famrel -0.26610 0.12129 -2.194 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 460.67 on 379 degrees of freedom
## AIC: 466.67
##
## Number of Fisher Scoring iterations: 4
So I´ll leave just the variable famrel to the model
m4 <- glm(high_use ~ famrel, data = alc, family = "binomial")
summary.glm(m4)
##
## Call:
## glm(formula = high_use ~ famrel, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1404 -0.8323 -0.7427 1.4483 1.6869
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1772 0.4812 0.368 0.7127
## famrel -0.2649 0.1212 -2.185 0.0289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 460.92 on 380 degrees of freedom
## AIC: 464.92
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m4, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE
## FALSE 268
## TRUE 114
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g2 <- g + geom_point()
g2
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE Sum
## FALSE 0.7015707 0.7015707
## TRUE 0.2984293 0.2984293
## Sum 1.0000000 1.0000000
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
So this means that the predictive power of my model is low (and probably more significant variables should be found to make it more predictive.) As long as all the predictions are under 0.5 the level of prediction is same with guessing.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293
Mean prediction error is approx. 0.298 which means that so many of predictions are false.
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cv$delta[1]
## [1] 0.3036649
The average number of wrong predictions in the cross validation is approx. 0.304
Another model where prediction error is smaller compared to the model introduced in DataCamp has variables: absences, sex, famrel and goout. It´s prediction error is about 0.207
m5 <- glm(high_use ~ absences + sex + famrel + goout, data = alc, family = "binomial")
summary.glm(m5)
##
## Call:
## glm(formula = high_use ~ absences + sex + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7151 -0.7820 -0.5137 0.7537 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
## absences 0.08168 0.02200 3.713 0.000205 ***
## sexM 1.01234 0.25895 3.909 9.25e-05 ***
## famrel -0.39378 0.14035 -2.806 0.005020 **
## goout 0.76761 0.12316 6.232 4.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.81 on 377 degrees of freedom
## AIC: 389.81
##
## Number of Fisher Scoring iterations: 4
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cv$delta[1]
## [1] 0.2146597
setwd("/Users/mirva/IODS-project/")
library(MASS); library(corrplot); library(dplyr); library(ggplot2); library(GGally)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This data includes 506 observations and 14 variables which are: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv. You can find the descriptions of the variables here
Drawing bar plots of the distributions of these variables isn´t very useful so compair them with pairs and corrplot
pairs(Boston)
As you can see the connections between these variables vary; some connections are linear (t.ex. between age and dis) and some are non-linear (t.ex between lstat and medv) and some don´t seem to be connected at all (t.ex. between black and rm )
An easier way to look at the correlations is correlation plot
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
As you can see here, the colors and shades describe the direction and the strength of the correlation between variables - the stronger the shade the stronger the correlation. There seems to be very strong positive correlation t.ex. between rad and tax and strong negative correlation t.ex. between age and dis. On the other hand there seems to be very weak negative correlation between t.ex. ptratio and black, zn and crim and rm and rad.
From correlation matrix we can see correlation coefficients between these variables.
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
As you can see t.ex. the strong positive correlation between rad and tax, which we saw earlier in the correlation plot, has coefficient r=.91 and weak negative correlation between ptratio and black has coefficient r=-.18.
I´ll standardize (scaled(x)=(x−mean(x))/sd(x)) the Boston dataset using scale() function because linear discrimination analysis is based on assumptions that variables are normally distributed and their variances are same.
boston_scaled <- scale(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
If we compare the summaries of Boston and boston_scaled we can see that the mean of every variable changed to 0 (so the variables have also negative values now) and also the range is much smaller.
At first I´ll convert the boston_scaled to a data frame format.
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
Then I´ll save the scaled crime rate
scaled_crim <- boston_scaled$crim
After that I`ll use the quantiles as the break points in the categorical variable and create a categorial variable “crime”
bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
I´ll remove original crim from the dataset and add the “crime” to scaled data
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Now I´ll divide the dataset to train and test sets, so that 80% of the data belongs to the train set
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Now I´ll fit a linear discriminant analysis with the function lda() and use the “crime” as target variable and other variables as predictors.
lda.fit <- lda(crime ~ ., data = train)
Here is the LDA biplot of this analysis
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
To save the crime categories from the test set I´ll create “correct_classes”
correct_classes <- test$crime
After that I´ll remove the categorical crime variable from the test dataset.
test <- dplyr::select(test, -crime)
Next I will predict the classes with the LDA model on the test data using function predict()
lda.pred <- predict(lda.fit, newdata = test)
Then I´ll cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 7 2 0
## med_low 5 14 3 0
## med_high 0 12 16 1
## high 0 0 1 25
So as you can see the predicted variables are columns and correct values are rows. If we look at first at the row where the correct value is low we can see that the most of the predicted values are in the class low (n=16) but there are also many values in the class med_low(12) and even some in the med_high (n=2) class. We can see same with the correct med_low class: most of the predicted values are on the med_low class (n=15) but almost as many are in the med_high group (n=14) and some in the low group (n=3). The situation with correct med_high and high classes is much better. In the correct med_high class there are 18 values in the predicted med_high class and only 2 in high class. In correct high class every value (n=20) is in the predicted high class.
At first I´ll reload the Boston data
data("Boston")
After that I´ll scale the variables to get comparable distances as I did earlier in this exercise
boston_scaled <- scale(Boston)
Then I´ll calculate the distances between the observations by creating an euclidean distance matrix
dist_eu <- dist(Boston)
Now I´ll first run the k-means algorithm where the number of clusters is 10
km <-kmeans(dist_eu, centers = 10)
pairs(Boston, col = km$cluster)
The plot seems to be quite messy so maybe there are too many clusters
To investigate the optimal number of clusters I will calculate the total within sum of squares
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
If we look at the plot we can see that the TWCSS drops radically when the number of clusters is two so that might be the optimal number of them.
I will run the k-means algorithm again
km <-kmeans(dist_eu, centers = 2)
Now if we draw the plot again we can see that they are much easier to interpret
pairs(Boston, col = km$cluster)
As we can see in many cases here (t.ex. between age and indus) the red clusters seem to be quite random but the black clusters seem to have some clear pattern.
This weeks data is part of the UN Development Programme, Human Development Reports, which you can find here The data is combined from two different data sets: Human Development Index and its components and Gender Inequality Index
At first I will load the ‘human’ data and then explore it´s structure and dimensions.
setwd("/Users/mirva/IODS-project/data")
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
As you can see there are 155 observations and 8 variables in this data. The variables are as follows: * Edu2.FM = The ratio of female and male populations with secondary education in each country (female/ male) * Labo.FM = The ratio of labour force participation of females and males in each country (female/male) * Life.Exp = Life expectancy at birth (years) * Edu.Exp = Expected years of schooling * GNI = Gross national income (GNI) per capita * Mat.Mor = Maternal mortality ratio (deaths per 100 000 live births) * Ado.Birth = Adolescent birth rate (births per 1 000 women ages 15–19) * Parli.F = Share of seats in parliament (% held by women)
Now that you know what the variables used here are let´s look at the graphics and summaries of the data.
library(GGally); library(dplyr); library(corrplot)
ggpairs(human)
cor(human) %>% corrplot()
As you can see from the correlation plots above some correlations are quite strong and some are very minimal. Strongest correlations we can find are between Mat.Mor and Life.Exp (-.857), Edu.Exp and Life.Exp (r=.789), Mat.Mor and Ado.Birth (r=.759), Life.Exp and GNI (r=.627) and Edu.Exp and GNI (r=.624). Weakest correlations are between Edu2.FM and Labo.FM (r=.00956), Labo.FM and GNI (r=-.0217), Labo.FM and Edu.Exp (r=.0473), Ado.Birth and Parli.F (r=-.0709) and Parli.F and Edu2.FM (r=.0786).
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As we can see from the summaries and also from the upper of the correlation plots above the median of the Edu2.FM is around 0.9375 so the ratio of female and male populations with secondary education is almost 1 meaning almost equal ratio of education. But in some country the minimum is 0.1717 meaning that the women get 2nd education rarely compared to men. The maximum here is 1.4967 meaning that in some countries women are more educated than men.
If we look at the summary of the Labo.FM median is 0.7535 so the men are in labour force more often than women. In many countries women are at home taking care of household and it might explain something of this result. It is interesting that the maximum here is only 1.0380 which tells that women in average don´t work more often than men.
Edu.Exp seems to be quite normally distributed and the median here is 13.50 (expected years of schooling). The 1st and 3rd quartiles are quite near this but the minimum is only 5.40 telling that some of people won´t probably go to school at all or just for few years. Maximum 20.20 years is also quite high meaning that in some countries almost everyone gets at least 2nd education and even higher.
The median of Life.Exp is 74.20 years. Quite shockingly the minimum here is only 49 years which might be a consequense for example of high child mortality. The maximum is 83.50 years so in some country the life expectancy is almost two times larger than in the country where it is 49 years.
As we can see in GNI the range is huge. The minimum is only 581$ and the maximum 123 124$ while the median is 12 040$. We can see from the graph also that GNI is far from uniform (or even normal) distribution and therefore is quite inequal.
Almost the same distribution is true with the Mat.Mor but here the minimum is 1 death, median 49 deaths and maximum 1100 deaths and 50% is between 11.5 and 190 deaths. So this means that in some country there is only 1 death per 100 000 life births (0.001%) and in some country 1100 deaths (1.1%) so the differences are very big but in most countries the death rate is quite low.
The median of the Ado.Birth is surplisingly high, 33.60 which is 3.36% of womean ages 15-19. Also the range is quite big; minimum is 0.60 and maximum 204.80.
Parli.F tells much of the equality between the genders. The minimum here is 0 meaning that in some country there are no women in parliament. The maximum is 57.50 so in some country there are more women in parliament than men. The median is 19.30 so in most countries women are under represented in parliaments compared to men.
Because there are 8 variables it would be hard to interpret the relations between them if we could use only pairwise comparisons. That´s why I will use principal component analysis (PCA) which reduces variables to combinations of the data. I will perform the PCA first on the non standardized human data and then on the standardized human data
pca1_human <- prcomp(human)
s1 <- summary(pca1_human)
s1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca1_pr <- round(100*s1$importance[2,], digits = 1)
pca1_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc1_lab <- paste0(names(pca1_pr), " (", pca1_pr, "%)")
biplot(pca1_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc1_lab[1], ylab = pc1_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca2_human <- prcomp(human_std)
s2 <- summary(pca2_human)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca2_pr <- round(100*s2$importance[2,], digits = 1)
pca2_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc2_lab <- paste0(names(pca2_pr), " (", pca2_pr, "%)")
biplot(pca2_human, cex = c(0.8, 0.7), col = c("grey40", "deeppink2"), xlab = pc2_lab[1], ylab = pc2_lab[2])
As you can see the plots created are very different. If we look at first at the PCA on non standardized data we can see that the first component PC1 explains 99,99% of the variation and PC2 0,01% so as we can see from the cumulative proportion together they explain 100% of the variation. Because there are different units of measure in the data variances are quite different and the PCA is really hard to interpret on non standardized data.
If we then look at the PCA on standardized data we can see that there are more components with some proportions and 100% is acquired not until PC8. PC1 and PC2 together explain 69,84% of the variaton. Also the plot is now easier to interpret. If we look at the arrows we can see that there are quite strong correlations between Edu.Exp, Life.Exp, Edu2.FM and GNI and also between Mat.Mor and Ado.Birth as we already noticed earlier. These 6 variables also correlate strongly with PC1. Parli.F and Labo.FM correlate most with each other and PC2. We can also see that for example in Mozambique and Burundi the ratio of labour force participation of females and males in each country and share of seats in parliament are quite high and also the maternal mortality ratio and adolescent birth ratio. In Mozambique and Burundi the ratio of female and male populations with secondary education, expected years of schooling, life expectancy at birth and gross national income (GNI) per capita are quite low. For example Iran has quite opposite pattern.
At first I will load the tea dataset from the FactoMineR package
library(FactoMineR); library(tidyr); library(ggplot2)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
There are 300 observations and 36 variables in this data.
I choose to keep the following columns: tea.time, tearoom, Tea, How, sugar, how and where.
keep_columns <- c("tea.time", "tearoom", "Tea", "How", "sugar", "how", "where")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## tea.time tearoom Tea How
## Not.tea time:131 Not.tearoom:242 black : 74 alone:195
## tea time :169 tearoom : 58 Earl Grey:193 lemon: 33
## green : 33 milk : 63
## other: 9
## sugar how where
## No.sugar:155 tea bag :170 chain store :192
## sugar :145 tea bag+unpackaged: 94 chain store+tea shop: 78
## unpackaged : 36 tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 7 variables:
## $ tea.time: Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
As you can see from the summary there are the following options in the chosen variables: * tea.time: not tea time or tea time * tearoom: not in a tearoom or in a tearoom * Tea: black, earl grey or green * How: alone, with lemon, milk or other * sugar: no sugar or with sugar * how: with a tea bag, with a tea bag and loose leaf (unpackaged) or only loose leaf * where: from a chain store, from a chain store and a tea shop or only from a tea shop
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
From the barplots we can see that most of the people have tea time, don´t drink at tearoom, drink earl grey, drink tea alone without any lemon or milk, use sugar, use only tea bags and buy their tea from a chain store.
I will use Multiple Correspondence Analysis (MCA) to inspect the connections between the variables I chose
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.280 0.233 0.185 0.160 0.148 0.141
## % of var. 16.335 13.580 10.802 9.327 8.633 8.213
## Cumulative % of var. 16.335 29.915 40.717 50.044 58.677 66.890
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.131 0.122 0.101 0.089 0.072 0.053
## % of var. 7.635 7.100 5.880 5.205 4.191 3.099
## Cumulative % of var. 74.525 81.625 87.505 92.711 96.901 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.525 0.328 0.256 | 0.021 0.001 0.000 | -0.242 0.105
## 2 | -0.353 0.148 0.082 | -0.076 0.008 0.004 | -0.622 0.696
## 3 | -0.295 0.104 0.140 | -0.148 0.031 0.035 | -0.284 0.145
## 4 | -0.680 0.551 0.645 | -0.105 0.016 0.015 | 0.230 0.095
## 5 | -0.537 0.344 0.414 | -0.040 0.002 0.002 | -0.186 0.062
## 6 | -0.537 0.344 0.414 | -0.040 0.002 0.002 | -0.186 0.062
## 7 | -0.295 0.104 0.140 | -0.148 0.031 0.035 | -0.284 0.145
## 8 | -0.111 0.015 0.009 | -0.183 0.048 0.023 | -0.720 0.933
## 9 | 0.567 0.383 0.199 | -0.472 0.319 0.138 | 0.093 0.015
## 10 | 0.907 0.980 0.384 | -0.147 0.031 0.010 | -0.346 0.215
## cos2
## 1 0.054 |
## 2 0.255 |
## 3 0.129 |
## 4 0.074 |
## 5 0.050 |
## 6 0.050 |
## 7 0.129 |
## 8 0.359 |
## 9 0.005 |
## 10 0.056 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## Not.tea time | -0.505 5.684 0.198 -7.690 | 0.205 1.121 0.032
## tea time | 0.392 4.406 0.198 7.690 | -0.159 0.869 0.032
## Not.tearoom | -0.327 4.392 0.445 -11.538 | 0.046 0.104 0.009
## tearoom | 1.363 18.324 0.445 11.538 | -0.191 0.434 0.009
## black | 0.462 2.684 0.070 4.570 | 0.178 0.479 0.010
## Earl Grey | -0.114 0.425 0.023 -2.643 | -0.248 2.423 0.111
## green | -0.370 0.769 0.017 -2.250 | 1.050 7.443 0.136
## alone | -0.157 0.819 0.046 -3.704 | 0.146 0.849 0.040
## lemon | 0.516 1.492 0.033 3.134 | 0.259 0.453 0.008
## milk | -0.049 0.026 0.001 -0.439 | -0.401 2.069 0.043
## v.test Dim.3 ctr cos2 v.test
## Not.tea time 3.114 | 0.167 0.934 0.021 2.535 |
## tea time -3.114 | -0.129 0.724 0.021 -2.535 |
## Not.tearoom 1.620 | 0.016 0.016 0.001 0.572 |
## tearoom -1.620 | -0.068 0.068 0.001 -0.572 |
## black 1.760 | -0.986 18.487 0.318 -9.753 |
## Earl Grey -5.754 | 0.437 9.462 0.344 10.140 |
## green 6.383 | -0.343 1.000 0.015 -2.087 |
## alone 3.438 | -0.185 1.723 0.064 -4.368 |
## lemon 1.575 | 1.622 22.318 0.325 9.859 |
## milk -3.572 | -0.076 0.092 0.002 -0.674 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.time | 0.198 0.032 0.021 |
## tearoom | 0.445 0.009 0.001 |
## Tea | 0.076 0.169 0.375 |
## How | 0.150 0.106 0.372 |
## sugar | 0.070 0.012 0.393 |
## how | 0.462 0.610 0.028 |
## where | 0.560 0.692 0.106 |
Here we can see that the dimensions 1 and 2 cover 29,915% of the variance. From the last table we can see how well different variables correlate with the first three dimensions.
plot(mca, invisible=c("ind"), habillage = "quali")
From the plot above we can see for example that buying unpackaged tea is closely related to buying tea from a tea shop. Also buying only tea bags is closely related to buying tea from a chain store. Drinking earl grey and drinking tea with lemon are closely related and also drinking black tea and drinking tea with milk.